The Ford GoBike trip dataset, is the one we have chosen for exploration in this project. It includes individual information about the rides made in a bike-sharing system covering the greater San-Franciso Bay area.The dataset contains trip data for February 2019. Our primary goal is to systematically explore this selected dataset, starting from plots of single variables and building up to plots of multiple variables. We will then produce a short presentation which illustrates some properties, trends, and relationships that we will discover from the dataset.
The following are the questions of consideration in our exploration.
What is the distribution for members' ages?
What is the distribution for member_gender and user_type features?
What is the distribution for the trip duration in minutes?
During what period of the day are more trips likely to be booked?
What are the top 5 popular start stations for the trips taken?
For each gender, how long, in minutes, does the trip last?
Is there any correlation between Member age and duration of trip?
What is the relationship between age and user_type?
What is the relationship between the 3 categorical variables period_ofday, user_type and member_gender?
For each period of the day, what is the average trip duration in minutes for each user type?
What is the relationship between member_gender, age and duration_min?
How closely correlated are the different variables in the dataset?
What is the relationship between member_gender, age and user_type?
# importing all the necessary packages and set plots to be embedded inline.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import calendar
import plotly.express as px
import time
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
# Loading the Ford Gobike dataset
bike_df=pd.read_csv('201902-fordgobike-tripdata.csv')
# displaying the first 5 rows of the Ford GoBike dataset.
bike_df.head(5)
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No |
| 1 | 42521 | 2019-02-28 18:53:21.7890 | 2019-03-01 06:42:03.0560 | 23.0 | The Embarcadero at Steuart St | 37.791464 | -122.391034 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 2535 | Customer | NaN | NaN | No |
| 2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No |
| 3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No |
| 4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes |
# displaying the shape of the dataset
print(bike_df.shape)
(183412, 16)
The dataset contains $183,412$ rows and $16$ columns.
# checking the datatypes of the dataset
print(bike_df.dtypes)
duration_sec int64 start_time object end_time object start_station_id float64 start_station_name object start_station_latitude float64 start_station_longitude float64 end_station_id float64 end_station_name object end_station_latitude float64 end_station_longitude float64 bike_id int64 user_type object member_birth_year float64 member_gender object bike_share_for_all_trip object dtype: object
From the datatypes displayed above, we see that our dataset contains 3 types of datatypes namely:
int64
object
float64
We observe that some of the data types are inappropriate and have to be changed. The following are the changes to be made to the datatypes of the dataset.
# checking some general information about the dataset.
bike_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
From the output above, we note that the data contains missing values.
# checking the features/columns with missing values.
bike_df.isna().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
It is observed that, start_station_id, start_station_name, end_station_id, end_station_name, member_birth_year and member_gender variables contains missing values.
To fix this, we will input $0$ to fill the missing values for numeric variables.
# inputing missing values with "0" for missing numerical variables
missval_col=['start_station_id','end_station_id','member_birth_year']
for i in missval_col:
bike_df[i]=bike_df[i].fillna(bike_df[i].mode()[0])
For non-numerical variables like start_station_name and end_station_name we will fill the missing values with None.
# inputing missing values with "None" for missing object type of variables.
bike_df['start_station_name'].fillna("None",inplace=True)
bike_df['end_station_name'].fillna("None",inplace=True)
# For the member_gender column, We will fill the null values with "No Gender"
bike_df['member_gender'].fillna("No Gender",inplace=True)
# Checking if the missing values have been inputed.
bike_df.isna().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 0 start_station_name 0 start_station_latitude 0 start_station_longitude 0 end_station_id 0 end_station_name 0 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 0 member_gender 0 bike_share_for_all_trip 0 dtype: int64
# Checking the descriptive statistics of the data.
bike_df.describe()
| duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | |
|---|---|---|---|---|---|---|---|---|---|
| count | 183412.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 183412.000000 |
| mean | 726.078435 | 138.503866 | 37.771223 | -122.352664 | 136.174743 | 37.771427 | -122.352250 | 4472.906375 | 1984.950347 |
| std | 1794.389780 | 111.750001 | 0.099581 | 0.117097 | 111.478306 | 0.099490 | 0.116673 | 1664.383394 | 9.908290 |
| min | 61.000000 | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 |
| 25% | 325.000000 | 47.000000 | 37.770083 | -122.412408 | 44.000000 | 37.770407 | -122.411726 | 3777.000000 | 1981.000000 |
| 50% | 514.000000 | 104.000000 | 37.780760 | -122.398285 | 100.000000 | 37.781010 | -122.398279 | 4958.000000 | 1988.000000 |
| 75% | 796.000000 | 239.000000 | 37.797280 | -122.286533 | 235.000000 | 37.797320 | -122.288045 | 5502.000000 | 1992.000000 |
| max | 85444.000000 | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 |
# changing the datatypes of certain selected features of interest to appropriate data types.
bike_df[['start_station_id','end_station_id','bike_id']]= bike_df[['start_station_id','end_station_id','bike_id']].astype('object')
bike_df[['user_type','member_gender']]= bike_df[['user_type','member_gender']].astype('category')
bike_df['member_birth_year']= bike_df['member_birth_year'].astype('int64')
# checking the unique values for member gender.
bike_df['member_gender'].unique()
['Male', 'No Gender', 'Other', 'Female'] Categories (4, object): ['Female', 'Male', 'No Gender', 'Other']
# Convert start time to morning, afternoon, and night of day
bike_df['start_time']=pd.to_datetime(bike_df['start_time'])
bike_df['start_hour']=bike_df['start_time'].apply(lambda i : i.hour)
bike_df['period_ofday']='morning'
bike_df['period_ofday'][(bike_df['start_hour'] >=12) & (bike_df['start_hour'] <=17)] = 'afternoon'
bike_df['period_ofday'][(bike_df['start_hour'] >=18) & (bike_df['start_hour'] <=23)] = 'night'
C:\Users\USER\AppData\Local\Temp\ipykernel_2580\1840100055.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bike_df['period_ofday'][(bike_df['start_hour'] >=12) & (bike_df['start_hour'] <=17)] = 'afternoon' C:\Users\USER\AppData\Local\Temp\ipykernel_2580\1840100055.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy bike_df['period_ofday'][(bike_df['start_hour'] >=18) & (bike_df['start_hour'] <=23)] = 'night'
#Testing the start hour and time of day columns created.
print(bike_df['start_hour'].value_counts())
print(bike_df['period_ofday'].value_counts())
17 21864 8 21056 18 16827 9 15903 16 14169 7 10614 19 9881 15 9174 12 8724 13 8551 10 8364 14 8152 11 7884 20 6482 21 4561 6 3485 22 2916 23 1646 0 925 5 896 1 548 2 381 4 235 3 174 Name: start_hour, dtype: int64 afternoon 70634 morning 70465 night 42313 Name: period_ofday, dtype: int64
#convert time period of day into ordered categorical data type.
ordinal_dict = {'period_ofday': ['morning', 'afternoon', 'night']}
for item in ordinal_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = ordinal_dict[item])
bike_df[item] = bike_df[item].astype(ordered_var)
# creating the age variable/feature
bike_df['age']=bike_df['member_birth_year'].apply(lambda birth_year : 2019 - birth_year )
# Testing if age variable has been added to the dataset.
print(bike_df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null datetime64[ns] 2 end_time 183412 non-null object 3 start_station_id 183412 non-null object 4 start_station_name 183412 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183412 non-null object 8 end_station_name 183412 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null object 12 user_type 183412 non-null category 13 member_birth_year 183412 non-null int64 14 member_gender 183412 non-null category 15 bike_share_for_all_trip 183412 non-null object 16 start_hour 183412 non-null int64 17 period_ofday 183412 non-null category 18 age 183412 non-null int64 dtypes: category(3), datetime64[ns](1), float64(4), int64(4), object(7) memory usage: 22.9+ MB None
The dataset includes $183\,412$ trips and $16$ features namely:
duration_sec
start_time
end_time
start_station_id
start_station_name
start_station_latitude
start_station_longitude
end_station_id
end_station_name
end_station_latitude
end_station_longitude
bike_id
user_type
member_birth_year
member_gender
bike_share_for_all_trip
The dataset contains 3 types of datatypes namely:
int64
object
float64
The start_time and end_time features in this dataset were recorded as objects datatype and these now have been changed to datetime. Also, as we are interested in finding out when most trips are taken in terms of time of day, the time variables will be broken down into time of day such as morning, afternoon and evening. With the membership birth year provided, we will use it to compute the ages of the members so as to investigate the relationship of the age with the duration of trip as well as the type of bike user.
The following are the main features of interest.
member_gender : The gender of the members, whether Male or Female.
member_birth_year : The date of birth for each user.
start_time : The start time for each trip.
The following are the features of interest to help support our investigation.
period_ofday : The period of time in a day, such as morining, afternoon and evening. Derived from the start_time variable.
duration_min : The trip duration in minutes.
In this section, we investigate the distributions of individual variables and observe if there are any unusual points or outliers as well as any relationships between variables.
# plotting the distribution for the age of bike users.
plt.figure(figsize=[10, 8],dpi=100)
plt.hist(data = bike_df, x = 'age')
plt.xlabel('Member age')
plt.title('Distibution of Age for the Members')
plt.show()
# plotting the box plot to check the outliers clearly.
plt.figure(figsize=(8,6))
plt.boxplot(bike_df['age']);
plt.xlabel('Member Age (Years)')
plt.ylabel('Frequency')
plt.title('Distribution of Members Age');
We observe that the distribution for members' ages is right skewed, with the majority of users in the age range of $20- 40$. There are also outliers in the age variable as it is not possisble to have a user who is above $100$ years.
# displaying the outliers in the age variable.
bike_df.query('age >= 100')
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | start_hour | period_ofday | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1285 | 148 | 2019-02-28 19:29:17.627 | 2019-02-28 19:31:45.9670 | 158.0 | Shattuck Ave at Telegraph Ave | 37.833279 | -122.263490 | 173.0 | Shattuck Ave at 55th St | 37.840364 | -122.264488 | 5391 | Subscriber | 1900 | Male | Yes | 19 | night | 119 |
| 10827 | 1315 | 2019-02-27 19:21:34.436 | 2019-02-27 19:43:30.0080 | 343.0 | Bryant St at 2nd St | 37.783172 | -122.393572 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 6249 | Subscriber | 1900 | Male | No | 19 | night | 119 |
| 16087 | 1131 | 2019-02-27 08:37:36.864 | 2019-02-27 08:56:28.0220 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 36.0 | Folsom St at 3rd St | 37.783830 | -122.398870 | 4968 | Subscriber | 1900 | Male | No | 8 | morning | 119 |
| 19375 | 641 | 2019-02-26 17:03:19.855 | 2019-02-26 17:14:01.6190 | 9.0 | Broadway at Battery St | 37.798572 | -122.400869 | 30.0 | San Francisco Caltrain (Townsend St at 4th St) | 37.776598 | -122.395282 | 6164 | Customer | 1900 | Male | No | 17 | afternoon | 119 |
| 21424 | 1424 | 2019-02-26 08:58:02.904 | 2019-02-26 09:21:47.7490 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 343.0 | Bryant St at 2nd St | 37.783172 | -122.393572 | 5344 | Subscriber | 1900 | Male | No | 8 | morning | 119 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 171996 | 1368 | 2019-02-03 17:33:54.607 | 2019-02-03 17:56:42.9490 | 37.0 | 2nd St at Folsom St | 37.785000 | -122.395936 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 4988 | Subscriber | 1900 | Male | No | 17 | afternoon | 119 |
| 173711 | 993 | 2019-02-03 09:45:30.464 | 2019-02-03 10:02:04.1690 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 36.0 | Folsom St at 3rd St | 37.783830 | -122.398870 | 5445 | Subscriber | 1900 | Male | No | 9 | morning | 119 |
| 177708 | 1527 | 2019-02-01 19:09:28.387 | 2019-02-01 19:34:55.9630 | 343.0 | Bryant St at 2nd St | 37.783172 | -122.393572 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 5286 | Subscriber | 1900 | Male | No | 19 | night | 119 |
| 177885 | 517 | 2019-02-01 18:38:40.471 | 2019-02-01 18:47:18.3920 | 25.0 | Howard St at 2nd St | 37.787522 | -122.397405 | 30.0 | San Francisco Caltrain (Townsend St at 4th St) | 37.776598 | -122.395282 | 2175 | Subscriber | 1902 | Female | No | 18 | night | 117 |
| 182830 | 428 | 2019-02-01 07:45:05.934 | 2019-02-01 07:52:14.9220 | 284.0 | Yerba Buena Center for the Arts (Howard St at ... | 37.784872 | -122.400876 | 67.0 | San Francisco Caltrain Station 2 (Townsend St... | 37.776639 | -122.395526 | 5031 | Subscriber | 1901 | Male | No | 7 | morning | 118 |
72 rows × 19 columns
# dropping the outliers in the age variable.
bike_df.drop(bike_df.query('age >= 100').index, inplace= True)
# Checking if the outliers have been removed.
bike_df['age'].describe()
count 183340.000000 mean 34.016379 std 9.766728 min 18.000000 25% 27.000000 50% 31.000000 75% 38.000000 max 99.000000 Name: age, dtype: float64
# Using the log scale for the histogram of Member age.
We now change the x-axis to log type, and change the axis limit
# checking the descriptive statistics for age on a log scale.
np.log10(bike_df['age'].describe())
count 5.263257 mean 1.531688 std 0.989749 min 1.255273 25% 1.431364 50% 1.491362 75% 1.579784 max 1.995635 Name: age, dtype: float64
# Axis transformation for age distribution.
bins= 10 ** np.arange(1.26,2+0.025,0.025)
plt.figure(figsize=(10,8))
plt.hist(data=bike_df,x='age',bins=bins)
plt.xscale('log')
plt.title('Age Distribution on a log Scale');
Upon removing the outliers and plotting the distribution for age on a log scale, we observe that the distribution is now a multimodal distribution as it has $3$ peaks.
# plotting the box plot to check if the outliers have been removed.
plt.figure(figsize=(8,6))
plt.boxplot(bike_df['age']);
plt.xlabel('Member Age (Year)')
plt.ylabel('Frequency')
plt.title('Distribution of Members Age');
From the box plot, we see that the outliers have been removed. We only have one outlier which is obove $90$ years.
# Plotting the 2 categorical variables user_type and member_gender to get an idea of the distribution of each.
fig, ax = plt.subplots(nrows=2, figsize=[10,10])
plot_color=sb.color_palette()[0]
sb.countplot(data=bike_df, x='user_type',color=plot_color,ax=ax[0])
sb.countplot(data=bike_df, x='member_gender',color=plot_color,ax=ax[1])
plt.show()
According to the plots above, the majority of users are subscribers, while the rest are customers. It is also observed that the male gender are in majority as compared to females. This is obvious we expect that most females are less likely to book a bike for a trip.
# we first begin by creating a new column called duration_min.
bike_df['duration_min']=bike_df['duration_sec']/60
# checking the summary statistics for trip duration in minutes.
bike_df['duration_min'].describe()
count 183340.000000 mean 12.101764 std 29.911955 min 1.016667 25% 5.416667 50% 8.566667 75% 13.266667 max 1424.066667 Name: duration_min, dtype: float64
# plotting the histogram for trip duration in minutes.
bin_size=300
bins=np.arange(0,bike_df['duration_min'].max()+bin_size, bin_size)
plt.figure(figsize=[10,8],dpi=100)
plt.hist(data=bike_df, x='duration_min',bins=bins)
plt.xlabel('duration (min)')
plt.title('Trip duration in Minutes', fontsize=15)
plt.show()
Most of the trips last between $12$ on average to about $300$ minutes. To make this plot more visible we will include the x limits to the plot.
# plotting the histogram with an addition of the xlim.
plt.figure(figsize=[10,8],dpi=200)
plt.hist(data=bike_df, x='duration_min',bins=100)
plt.xlabel('duration (min)')
plt.xlim((0,100));
plt.title('Trip duration in Minutes', fontsize=15)
plt.show()
Upon using the x-limits, it is now clear that most of the trips do not last for that an hour.
# plotting the barplot to check the most frequent period of the day when most trips are booked.
freq=bike_df['period_ofday'].value_counts()
order_period=freq.index
plt.figure(figsize=[10,10])
plot_color=sb.color_palette()[0]
sb.countplot(data=bike_df, x='period_ofday',color=plot_color,order=order_period)
plt.show()
Most trips are booked in the afternoon and morning, with few in the night. It is expected that the majority of users would prefer to book in the morning and afternoon as compared to night time.
# plotting the barplot for the period of day with percentages on top.
total_period=bike_df['period_ofday'].value_counts().sum()
plt.figure(figsize=[10,10],dpi=100)
sb.countplot(data=bike_df,y='period_ofday',color='lightgray',order=order_period);
for j in range(freq.shape[0]):
count=freq[j]
pct_string='{:0.1f}'.format(100*count/total_period)
plt.text(count+1,j, pct_string, va="center")
plt.xlabel('Frequency')
plt.ylabel('Period of the day')
plt.title('Period of the day with Most Trips')
From the above plot, we observe that $38.5 \%$ of trips are made in the afternoon, $38.4 \%$ in the morning and $23.1 \%$ during night.
#Plotting the pie chart for the top 5 popular start stations.
bike_df['start_station_name'].value_counts()[:5].plot(kind='pie',figsize=(10,10),autopct='%0.1F%%');
plt.title('Top Five Start Stations',fontsize=20);
pd.DataFrame(bike_df['start_station_name'].value_counts()[:5])
| start_station_name | |
|---|---|
| Market St at 10th St | 3904 |
| San Francisco Caltrain Station 2 (Townsend St at 4th St) | 3542 |
| Berry St at 4th St | 3051 |
| Montgomery St BART Station (Market St at 2nd St) | 2895 |
| Powell St BART Station (Market St at 4th St) | 2760 |
The top five start station names are:
Market St at 10th St
San Francisco Caltrain Station 2 (Townsend St at 4th St)
Berry St at 4th St
Montgomery St BART Station (Market St at 2nd St)
Powell St BART Station (Market St at 4th St)
We note that the market area is the best place to set up a start station as it is the centre of the business district.
For the univariate visualization the first variable of interest that was investigated was the member age. The distribution was found to be right skewed and contained outliers as was seen from the boxplot. These outliers were dropped and some transformations were made on the x=axis by changing axis limit to a log scale. Upon making these transformations, the age distribution was now a multimodal distribution with $3$ peaks.
The member gender and user type distributions showed that the majority of bike clients were subscribers while a few of them were customers. It was also observed that, the Male population was in majority as compared to Female users.
Furthermore, for the trip duration, it was noted that the most of the trips did not last more than 60 minutes. For the period_ofday feature, the time period with most trips was the afternoon with $38.5\%$ of trips and $38.4\%$ in the morning with $23.1\%$ of trips during the night.
From the features investigated, there were some unsual distribution such as:
The presence of outliers for the age variable, which were dropped.
Inappropriate datatypes for most features which were changed to suitable datatypes.
Lastly, we had to create some new columns which were useful for our analysis such as age, period_ofday and duration_min which is the trip duration in minutes.
# plotting the barplot for the trip duration for each gender.
plt.figure(figsize=(10,8))
sb.barplot(data=bike_df,x='member_gender',y='duration_min')
plt.xlabel("Member Gender")
plt.ylabel('Trip duration (min)')
plt.title('Duration of trip for each gender in minutes');
From the barplot above, we note that the category with No gender that was provided in the dataset has a longer trip duration time followed by the other gender. We also further note that the Female gender have a slightly longer trip duration time compared to the Male gender.
# Checking the correlation between age and duration_min by plotting the heatmap.
numeric_vars=['age','duration_min']
plt.figure(figsize=(10,8))
sb.heatmap(bike_df[numeric_vars].corr(),annot=True,fmt='.2f',cmap='vlag_r',center=0)
plt.show()
From the heatmap, we observe that there is no correlation between the variables age and trip duration in minutes.
# Scatter plot to Check the correlation between age and duration_min.
plt.figure(figsize=(10,6))
sb.regplot(data=bike_df, x ='age', y ='duration_min');
plt.xlabel('Age (Years)')
plt.ylabel('Trip duration (min)');
# Checking the average trip duration time.
bike_df['duration_min'].describe()
count 183340.000000 mean 12.101764 std 29.911955 min 1.016667 25% 5.416667 50% 8.566667 75% 13.266667 max 1424.066667 Name: duration_min, dtype: float64
The scatter plot also verifies that there exist no relationship between age and trip duration in minutes. I was of the thought that the older that user, the more time it would like to complete a given trip as physical enough and speed reduces as one grows order.
#plotting the relationship between age and user_type with the aid of a violin plot.
plt.figure(figsize = [16, 5],dpi=100)
base_color=sb.color_palette()[0]
plt.subplot(1,2,1)
ax1 = sb.violinplot(data=bike_df, y ='age', x='user_type', color=base_color)
plt.xticks(rotation = 15);
#plotting the relationship between age and user_type with the aid of a boxplot.
plt.subplot(1,2,2)
sb.boxplot(data=bike_df, y = 'age', x='user_type', color=base_color)
plt.xticks(rotation = 15);
plt.ylim(ax1.get_ylim());
From the boxplot and violin plots above, we see that the median for Customer user type is slighly less than that of the subscriber. We also note that the subscriber has a higher maximum age with more extreme outliers as compared to the customer user type.
# plotting the relationship between the categorical varialbes period_ofday, user_type and member_gender,
plt.figure(figsize=(10,10),dpi=100)
#Subplot 1: period_ofday vs user_type
plt.subplot(3,1,1)
sb.countplot(data=bike_df, x='period_ofday',hue='user_type',palette='Blues')
#Subplot 2: period_ofday vs member_gender
ax=plt.subplot(3,1,2)
sb.countplot(data=bike_df, x='period_ofday',hue='member_gender',palette='Blues')
ax.legend(ncol=2)
#Subplot 3: user_type vs member_gender
ax=plt.subplot(3,1,3)
sb.countplot(data=bike_df, x='user_type',hue='member_gender',palette='Reds')
ax.legend(loc=2, ncol=2)
plt.show()
There are more subscriber user types in all the $3$ periods of the day as compared to customer user types.
We also note that there are less Female users during the $3$ periods of the day as opposed to Male users which are in majority.
From the relationship between user_type and gender, we note that there are very few Female customers and subscribers as compared to the Male population in both user_types. We further note that there are few Male customer user type compared to Male Subscribers
def chart_plot(x):
"""This function groups a certain feature (x) by user_type and plots a pie chart representing the % of user types
in relation to the given feature.
"""
fea_stat=bike_df.groupby([x],as_index=False)["user_type"].count()
chart_plot=px.pie(fea_stat,names=x,values='user_type',color_discrete_sequence=px.colors.sequential.RdBu,width=700,height=600 )
return chart_plot
chart_plot('member_gender')
A large percentage of bike users are Males.
chart_plot('period_ofday')
The afternoon and morning period of day are the peak periods of the day with more users for both customer and Subscriber user types.
Some observations on the Bivariate exploration are:
The Male gender has a slightly shorter trip duration time compared to the Females.
Subscribers tend to have a shorter trip duration than customers.
For the both user types, the male gender are in majority.
It was interesting to note that the Male gender were in majority across all the $3$ periods of the day namely morning, afternoon and night. We also noted that, for the $3$ periods of the day, there were more subscribers than customers.
Surprisingly, it was observed that there was no correlation between age and duration of the trip. Logically, one might be of the idea that the users who are old will have a longer trip duration.
In this section, we wil Create plots of three or more variables to investigate further relationships between the different variables of interest.
# plotting the period of day trip duration for each user type using the clustered barplot.
plt.figure(figsize=[10,8],dpi=100)
plot=sb.barplot(data=bike_df, x = 'period_ofday', y='duration_min', hue='user_type',ci=None)
plot.set(xlabel="Period of day", ylabel='Trip duration (min)')
plt.title('Period of day duration usage');
For the morning, afternoon and night periods of the day, we note that the trip duration is longer for customer user type compared to the subscribers. With the afternoon period having the hightest trip duration time for customer user type.
# Plotting the relationship between member_gender, age and duration_min.
plt.figure(figsize=[10,10],dpi=200);
sb.scatterplot(data=bike_df ,x='age',y='duration_min',hue= 'member_gender', linewidth =0);
We note the following from the above scatter plot:
The trip duration for the majority of the Male gender is between $0-200$ minutes as this is were most of the points are clustered.
They are relatively less Female users looking at the blue points indicating the female gender and we can also see that the trip duration time is longer for most Female users compared to the Male gender.
For the age range from $80-100$ we observe that the trip duration is nearly zero, which suggests that these might be outliers.
# plotting the correlation matrix for all the variables in the dataset.
plt.figure(figsize=(10,8))
sb.heatmap(bike_df.corr(),annot=True,fmt='.2f',cmap='vlag_r',center=0)
plt.show()
The above heatmap, shows the correlation between each feature in the dataset. There exist a strong correlation between start station longitude and end station longitude. We also see that there is no correlation for any of the features or variables with the time variables duration_sec and duration_min. Furthermore, we observe that there is a weak correlation between age and member_birth_year.
# Plotting the relationship between member_gender, age and user_type using a pointplot.
fig = plt.figure(figsize = [10,8],dpi=100)
ax = sb.pointplot(data = bike_df, x = 'member_gender', y = 'age', hue = 'user_type',
palette = 'Blues', linestyles = '', dodge = True,errorbar="sd")
plt.title('Members\' age across gender and user types')
plt.xlabel('Member Gender')
plt.ylabel(' Age (Year)')
plt.show();
The pointplot above shows how the relationship between the user type variable changes across the member gender variable in relation to member age.
We observe that the Female gender has a lower value for the mean age for the customer user type compared to the subscriber. For the Male gender, the mean age for the customer user type is greater than that of the female gender. We also observe that, for the Male gender, the mean age for subscriber has a higher value than the females. We further note that, for other gender, the mean age for both customer and subscriber are greater than the rest of the gender types. Generally, the customer user type has a lower mean age across all genders as compared to the subscriber.
From the Multivariate visualization, the following were some of the relationships observed:
The customer user type tend to use the bike services more in the afternoon and have a higher trip duration time compared to subscriber user type.
The subscriber user type has a higher mean age across all gender types compared to the customer user type.
There is a strong correlation between start station longitude and end station longitude. Thus, the location of the user types play a row in determing the best marketing strategy.
It was suprising to note that the member age between $80-100$ had a lower trip duration and also that the other gender had a higher mean age for both customer and subscriber user type.
Interestly and as expected there are relatively less female bike users compared to the male population. Also, the Female gender have a slightly higher trip duration compared to the Male gender.
* Market St at 10th St
* San Francisco Caltrain Station 2 (Townsend St at 4th St)
* Berry St at 4th St
* Montgomery St BART Station (Market St at 2nd St)
* Powell St BART Station (Market St at 4th St)